Capturing the semantic structure of documents using summaries in Supplemented Latent Semantic Analysis
نویسندگان
چکیده
Latent Semantic Analysis (LSA) is a mathematical technique that is used to capture the semantic structure of documents based on correlations among textual elements within them. Summaries of documents contain words that actually contribute towards the concepts of documents. In the present work, summaries are used in LSA along with supplementary information such as document category and domain information in the model. This modification is referred as Supplemented Latent Semantic Analysis (SLSA) in this paper. SLSA is used to capture the semantic structure of documents using summaries of various proportions instead of entire full-length documents. The performance of SLSA on summaries is empirically evaluated in a document classification application by comparing the accuracies of classification against plain LSA on full-length documents. It is empirically shown that instead of using full-length documents, their summaries can be used to capture the semantic structure of documents. Key–Words: Dimensionality Reduction, Document Classification, Latent Semantic Analysis, Semantic Structure, Singular Value Decomposition.
منابع مشابه
Query expansion based on relevance feedback and latent semantic analysis
Web search engines are one of the most popular tools on the Internet which are widely-used by expert and novice users. Constructing an adequate query which represents the best specification of users’ information need to the search engine is an important concern of web users. Query expansion is a way to reduce this concern and increase user satisfaction. In this paper, a new method of query expa...
متن کاملGenerating Coherent Extracts of Single Documents Using Latent Semantic Analysis
Generating Coherent Extracts of Single Documents Using Latent Semantic Analysis Tristan Miller Master of Science Graduate Department of Computer Science University of Toronto 2003 A major problem with automatically-produced summaries in general, and extracts in particular, is that the output text often lacks textual coherence. Our goal is to improve the textual coherence of automatically produc...
متن کاملObtaining Single Document Summaries Using Latent Dirichlet Allocation
In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary topics which best describe a given document and only extract sentences from those paragraphs within the document which are highly correlated give...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملMulti-layered Summarization of Spo Information Extraction and S
The spoken documents are very difficult to be shown on the screen, and very difficult to retrieve and browse. It is therefore important to develop technologies to summarize the entire archives of the huge quantities of spoken documents in the network content to help the user in browsing and retrieval. In this paper we propose a complete set of multi-layered technologies to handle at least some ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015